Predicting stock price trends by interpreting the seemingly chaotic market data has warranted investigation by both investors and researchers. Among the multitude of methods that have been employed to model this dependence, machine learning techniques are the most popular by far, owing to the capability of identifying stock trends from massive amounts of data that capture the underlying stock price dynamics. As we all know, trading firms keep iterating their models as market and economic conditions vary. As a result, there are no universal equity models and evaluation standards.
In this datathon, with the huge range of datasets available to us, we decided to try and model the financial trends of a company by finding correlations with it's recruitment needs and economic trends of the state it is based out of. Hence, the question we tackled in this datathon is Can we predict a company's stock prices on the basis of its job posting activity and the economic conditions of the state it is based in.
Stock trading is one of the main investment activities in the business market. In the past, in order to maximize benefits,investors have developed several stock analysis algorithms that can help them forecast the movement of the stock price. Whether stocks will rise or fall in a certain period of time is invaluable information to the investors. Predicting the direction of stock prices is particularly important for value investing.
We believe answering the question we have formulated may be a major piece of the stock market puzzle and is vital in deepening our understanding of the relationship between economic trends of a state and their effect on the stock pricing of companies based in that state. Therefore, prediction of stock prices by modeling its correlations with economic conditions and job posting activity can serve as a strong predictive tool for investors.
To analysis the relationship between stock prices and the job openings and economic parameters we need to use the following datasets:
| Data | Description |
|---|---|
| jobs | Job openings data (title, company, location, category, dates, etc.) for over 400 companies from August 2007 to December 2015. |
| companies | Important details (name, scrape dates, location, tickers, sectors, etc.) on various companies. |
| econ_state | Economic data (GDP, per capita income, unemployment rate, etc.) for all 50 states and the District of Columbia, taken from 1980 – 2016. |
| geographic | Latitude and longitude data organized alphabetically by city and state. |
| financial | Time series of financial data (close price, ex-dividends, split adjustments, adjusted close price) for over 3,000 stocks on U.S. exchanges, taken from 2007 – 2016. |
Moreover, we created some other sub datasets using above-mentioned data.
In this section we focus on Exploratory Data Analysis (EDA) to determine the importance of the features and then we do some feature engineering based on the results of this section.
Here we focus on the top 5 industries based on the frequesny of their job openings: (1) General Management and Business, (2) Accounting and Finance, (3) Restaurants and Food Services, (4) Technology, (5) Retail. In the following bar plot, the X-axis is the unique category ID for industies and the y-axis is the number of job posts:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import pandas as pd
import seaborn as sns
df_jobs = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/Datathon/jobs.csv')
plt.figure(figsize=(12,6))
df_jobs_cat = df_jobs['category_id'].dropna()
df_jobs_cat = pd.Series.to_frame(df_jobs_cat)
df_jobs_cat.columns=['category_id']
df_job_count = df_jobs_cat.groupby('category_id').size()
bar_width=0.4
plt.bar(np.arange(0,len(df_job_count))-0.5*bar_width,df_job_count, alpha=0.5, width = bar_width, color= 'b', label='Number of jobs in Cat ID')
We normalized the number of job posts for these five top industries and plot them over the U.S. map based on their zip codes
df_zip = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/Datathon/zipcodes.csv')
df_top5 = df_jobs.loc[(df_jobs['category_id'] == 46) | (df_jobs['category_id'] == 1) | (df_jobs['category_id'] == 122) | (df_jobs['category_id'] == 141) | (df_jobs['category_id'] == 127)]
df_top5_gr = df_top5[['zip','category_id']].groupby(['zip','category_id']).size()
df_top5_gr = df_top5_gr.reset_index()
df_top5_map = pd.merge(df_top5_gr, df_zip, how='inner', left_on = 'zip', right_on = 'ZIP')
df_top5_map.columns = ['zip', 'category_id', 'count', 'ZIP', 'LAT', 'LNG']
import folium
from folium.plugins import MarkerCluster
from ipywidgets import *
from IPython.display import display
lax_cord = (33.942809, -118.404706)
df_top1 = df_top5_map.loc[(df_top5_map['category_id'] == 46)]
df_top2 = df_top5_map.loc[(df_top5_map['category_id'] == 1)]
df_top3 = df_top5_map.loc[(df_top5_map['category_id'] == 122)]
df_top4 = df_top5_map.loc[(df_top5_map['category_id'] == 141)]
df_top5 = df_top5_map.loc[(df_top5_map['category_id'] == 127)]
map = folium.Map(location=lax_cord, zoom_start=3,tiles='Mapbox Bright')
for i in df_top5.iterrows():
folium.CircleMarker(location=[i[1]['LAT'],i[1]['LNG']], radius=i[1]['count']/100,
fill_color='#4c4cff').add_to(map)
for i in df_top4.iterrows():
folium.CircleMarker(location=[i[1]['LAT'],i[1]['LNG']], radius=i[1]['count']/100,
fill_color='#ff4c4c').add_to(map)
for i in df_top3.iterrows():
folium.CircleMarker(location=[i[1]['LAT'],i[1]['LNG']], radius=i[1]['count']/100,
fill_color='#4cff4c').add_to(map)
for i in df_top2.iterrows():
folium.CircleMarker(location=[i[1]['LAT'],i[1]['LNG']], radius=i[1]['count']/100,
fill_color='#ff4cff').add_to(map)
for i in df_top1.iterrows():
folium.CircleMarker(location=[i[1]['LAT'],i[1]['LNG']], radius=i[1]['count']/100,
fill_color='#ffff4c').add_to(map)
display(map)